{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from IPython.core.interactiveshell import InteractiveShell\n",
    "InteractiveShell.ast_node_interactivity = 'all'  # default is ‘last_expr'\n",
    "\n",
    "%load_ext autoreload\n",
    "%autoreload 2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import sys\n",
    "sys.path.append('/Users/siyuyang/Source/repos/GitHub_MSFT/CameraTraps')  # append this repo to PYTHONPATH"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "import os\n",
    "from collections import Counter, defaultdict\n",
    "from random import sample\n",
    "import math\n",
    "\n",
    "import pandas as pd\n",
    "from tqdm import tqdm\n",
    "\n",
    "from data_management.megadb.schema import sequences_schema_check\n",
    "from data_management.annotations.add_bounding_boxes_to_megadb import *\n",
    "from data_management.megadb.converters.cct_to_megadb import make_cct_embedded, process_sequences, write_json"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# IDFG SWWLF 2019\n",
    "\n",
    "#### Notes on swwlf2019_pic_dat.csv from Sarah\n",
    " \n",
    "- We are converting data so that there is a row for every image, and if the image has more than 1 species, there is more than 1 row. If the image has no animals/humans, then the species is labeled “none” and there is just that 1 row.\n",
    " \n",
    "- In cases where there was a human, riding a horse, with a dog and cattle in 1 image (e.g.) – our labels get messy and are not always perfect. Anything more than 2 species and our system falls apart a little. This is incredibly rare with actual wildlife, so we weren’t super concerned, but definitely combinations of human-pets-livestock are a bit less tidy.\n",
    " \n",
    "Fields:\n",
    "\n",
    "File: unique id that matches the name of the image\n",
    "\n",
    "Opstate: refers to the status of the camera. Normal = functioning as expected J. Once a camera is noted as “severely misdirected”, we often stopped labeling images. “maintenance” is usually set up and take down. These are often odd, close up images of humans. If I were you, I’d only look at images where OpState = “normal”.\n",
    "\n",
    "Date\n",
    "\n",
    "Time\n",
    "\n",
    "Pic__CamID = camera id.\n",
    "\n",
    "Trigger mode: M = Motion, T = Time, others are usually errors (C and U, I think, appear when a camera is in an error mode).\n",
    "\n",
    "NearFar refers to if the animal was close or far from the camera (far animals aren’t counted). Again, I would possibly focus on “Near” instances as Far are often not captured by a motion trigger, and might be sort of rare in normal image sets.\n",
    "\n",
    "Species = the species observed in the image\n",
    "\n",
    "Count = number of individuals of that species.\n",
    "\n",
    "Location relative to SWWLF – should give a folder location within the SWWLF2019 folder."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "path_to_output = '/Users/siyuyang/OneDrive - Microsoft/AI4Earth/CameraTrap/Databases/megadb_2020/idfg_swwlf_2019_megadb.json'  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Name of the dataset**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "dataset_name = 'idfg_swwlf_2019'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 0 - Add an entry to the `datasets` table\n",
    "\n",
    "done"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 1 - Prepare the `sequence` objects to insert into the database"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step 1b - If you're starting from scratch..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "csv_path = '/Users/siyuyang/Source/temp_data/CameraTrap/engagements/IDFG/swwlf2019_pic_dat.csv'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/siyuyang/anaconda/envs/cameratraps/lib/python3.6/site-packages/IPython/core/interactiveshell.py:3063: DtypeWarning: Columns (16,20) have mixed types.Specify dtype option on import or set low_memory=False.\n",
      "  interactivity=interactivity, compiler=compiler, result=result)\n",
      "/Users/siyuyang/anaconda/envs/cameratraps/lib/python3.6/site-packages/numpy/lib/arraysetops.py:569: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison\n",
      "  mask |= (ar1 == a)\n"
     ]
    }
   ],
   "source": [
    "timelapse_df = pd.read_csv(csv_path, index_col=0, header=0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "11717080"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(timelapse_df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "File                            object\n",
       "OpState                         object\n",
       "Processor                       object\n",
       "Date                            object\n",
       "Time                            object\n",
       "posix_date_time                 object\n",
       "Pic_CamID                       object\n",
       "TriggerMode                     object\n",
       "Animal                            bool\n",
       "Livestock                         bool\n",
       "Human                             bool\n",
       "Empty                             bool\n",
       "YoungPresent                      bool\n",
       "MarkedAnimal                      bool\n",
       "IllnessUnusual                    bool\n",
       "Comments                        object\n",
       "NearFar                         object\n",
       "Species                         object\n",
       "Count                          float64\n",
       "SPcomment                       object\n",
       "Project                         object\n",
       "Location_Relative_SWWLF2019     object\n",
       "dtype: object"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "timelapse_df.dtypes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['SWWLF2019'], dtype=object)"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "timelapse_df['Project'].unique()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "timelapse_df = timelapse_df.drop(columns=['File', 'Processor', 'Date', 'Time', 'SPcomment', 'Project'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "timelapse_df.sample(5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['maintenance', 'normal', 'malfunction', 'minorly misdirected',\n",
       "       'completely obscured', 'severely misdirected',\n",
       "       'partially obscured'], dtype=object)"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "timelapse_df['OpState'].unique()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['M', 'T', 'U', 'C'], dtype=object)"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "timelapse_df['TriggerMode'].unique()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [],
   "source": [
    "one_cam = timelapse_df[timelapse_df['Pic_CamID'] == 'IDFG2637']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "len(one_cam)\n",
    "one_cam.sample(10)  # R6/GMU60/F_2139/IDFG2637_20190615_125437_MD_1.JPG\n",
    "# looks like the path prefix/folder is the same for the same camera, so cam ID is location ID"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Compile images and sequences"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "%%time\n",
    "# 55 minutes to run this\n",
    "\n",
    "# need to consolidate rows into image entries\n",
    "\n",
    "im_to_classes = defaultdict(list)\n",
    "im_to_count = defaultdict(float)  # we sum the count across diff species on each image\n",
    "im_has_young = []  # to be made a set\n",
    "all_species = set()\n",
    "\n",
    "other_attributes = {}  # attributes that do not need to reconcile among rows\n",
    "\n",
    "for i_row, row in tqdm(timelapse_df.iterrows()):\n",
    "    im_path = row['Location_Relative_SWWLF2019']\n",
    "    \n",
    "    im_to_count[im_path] += row['Count']\n",
    "    \n",
    "    row_species = row['Species']\n",
    "    if row_species != 'none':\n",
    "        im_to_classes[im_path].append(row['Species'])\n",
    "        all_species.add(row['Species'])\n",
    "    \n",
    "    if row['YoungPresent'] is True:\n",
    "        im_has_young.append(im_path)  # mark as true if any species has young present\n",
    "    \n",
    "    if im_path not in other_attributes:\n",
    "        other_attributes[im_path] = {\n",
    "            'datetime': row['posix_date_time'],\n",
    "            'location': row['Pic_CamID'],\n",
    "            'trigger': row['TriggerMode'],\n",
    "            'op_state': row['OpState']\n",
    "        }\n",
    "    \n",
    "#     if i_row > 10000:\n",
    "#         break\n",
    "\n",
    "im_has_young = set(im_has_young)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "11686098"
      ]
     },
     "execution_count": 50,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "1161781"
      ]
     },
     "execution_count": 50,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "11686098"
      ]
     },
     "execution_count": 50,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "43"
      ]
     },
     "execution_count": 50,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(other_attributes)  # num of images\n",
    "len(im_to_classes)\n",
    "len(im_to_count)\n",
    "len(all_species)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "Motion-triggered have sequence number and are named as\n",
    "```\n",
    "R2/GMU10A/F_653/IDFG2426_20190627_103229_MD_3.JPG\n",
    "```\n",
    "where the number after `MD_` is the sequence number\n",
    "\n",
    "Time-triggered are named as\n",
    "```\n",
    "R2/GMU10A/F_653/IDFG2426_20190628_231000_TL_0.JPG\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "metadata": {},
   "outputs": [],
   "source": [
    "im_to_count_int = {}\n",
    "\n",
    "for im_path, animal_count in im_to_count.items():\n",
    "    if not math.isnan(animal_count):\n",
    "        im_to_count_int[im_path] = int(animal_count)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 68,
   "metadata": {},
   "outputs": [],
   "source": [
    "del im_to_count"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 72,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 11686098/11686098 [01:12<00:00, 161043.62it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 1min 4s, sys: 6.4 s, total: 1min 11s\n",
      "Wall time: 1min 12s\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "\n",
    "#c = 0\n",
    "\n",
    "for im_path, im_attributes in tqdm(other_attributes.items()):\n",
    "    p = os.path.basename(im_path).split('_')\n",
    "    supposed_trigger_type = p[-2]  # MD or TL\n",
    "    seq_id = '_'.join(im_path.split('_')[:-2])\n",
    "    \n",
    "    if supposed_trigger_type == 'TL':\n",
    "        seq_id = 'dummy_' + seq_id\n",
    "        frame_num = None\n",
    "    else:\n",
    "        frame_num = p[-1].split('.')[0]\n",
    "        if frame_num.endswith('b'):\n",
    "            frame_num = None\n",
    "        else:\n",
    "            frame_num = int(frame_num)\n",
    "        #print(frame_num)\n",
    "    \n",
    "    im_classes = im_to_classes.get(im_path, None)\n",
    "    if im_classes is None:\n",
    "        if im_attributes['op_state'] == 'severely misdirected':\n",
    "            im_classes = ['__label_unavailable']\n",
    "        else:\n",
    "            im_classes = ['empty']\n",
    "    else:\n",
    "        im_classes = list(set(im_classes))\n",
    "    \n",
    "    count = im_to_count_int.get(im_path, None)\n",
    "\n",
    "#     print(im_path)\n",
    "#     print(supposed_trigger_type)\n",
    "#     print(seq_id)\n",
    "#     print(im_classes)\n",
    "#     print(count)\n",
    "#     print()\n",
    "#     c += 1\n",
    "#     if c > 1000:\n",
    "#         break\n",
    "\n",
    "    im_attributes['file'] = im_path\n",
    "    \n",
    "    im_attributes['seq_id'] = seq_id\n",
    "    if frame_num is not None:\n",
    "        im_attributes['frame_num'] = frame_num\n",
    "    \n",
    "    im_attributes['class'] = im_classes\n",
    "\n",
    "    if count is not None:\n",
    "        im_attributes['count'] = count"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 73,
   "metadata": {},
   "outputs": [],
   "source": [
    "embedded = list(other_attributes.values())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 74,
   "metadata": {},
   "outputs": [],
   "source": [
    "del other_attributes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 75,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The dataset_name is set to idfg_swwlf_2019. Please make sure this is correct!\n",
      "Making a deep copy of docs...\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "  0%|          | 26218/11686098 [00:00<00:44, 262109.63it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Putting 11686098 images into sequences...\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 11686098/11686098 [02:32<00:00, 76750.68it/s]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of sequences: 10959983\n",
      "Checking the location field...\n",
      "Checking which fields in a CCT image entry are sequence-level...\n",
      "\n",
      "all_img_properties\n",
      "{'frame_num', 'trigger', 'class', 'datetime', 'op_state', 'file', 'count', 'location'}\n",
      "\n",
      "img_level_properties\n",
      "{'frame_num', 'class', 'datetime', 'op_state', 'file', 'count'}\n",
      "\n",
      "image-level properties that really should be sequence-level\n",
      "{'trigger', 'location'}\n",
      "\n",
      "Finished processing sequences.\n",
      "Example sequence items:\n",
      "\n",
      "{'seq_id': 'R7/GMU21A/S_91/IDFG0035_20190606_082306', 'dataset': 'idfg_swwlf_2019', 'images': [{'datetime': '2019-06-06 08:23:06', 'op_state': 'maintenance', 'file': 'R7/GMU21A/S_91/IDFG0035_20190606_082306_MD_1.JPG', 'frame_num': 1, 'class': ['human'], 'count': 0}], 'trigger': 'M', 'location': 'IDFG0035'}\n",
      "\n",
      "[{'seq_id': 'dummy_R7/GMU36/F_1953/IDFG2348_20190903_212000', 'dataset': 'idfg_swwlf_2019', 'images': [{'datetime': '2019-09-03 21:20:00', 'op_state': 'normal', 'file': 'R7/GMU36/F_1953/IDFG2348_20190903_212000_TL_0.JPG', 'class': ['empty'], 'count': 0}], 'trigger': 'T', 'location': 'IDFG2348'}]\n",
      "\n"
     ]
    }
   ],
   "source": [
    "sequences = process_sequences(embedded, dataset_name)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "# re-start \n",
    "\n",
    "with open('/Users/siyuyang/OneDrive - Microsoft/AI4Earth/CameraTrap/Databases/megadb_2020/temp_idfg_swwlf_2019_megadb.json') as f:\n",
    "    sequences = json.load(f)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "species_present = set()\n",
    "for seq in sequences:\n",
    "    for im in seq['images']:\n",
    "        if im['class'][0] != 'empty':\n",
    "            species_present.update(im['class'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "# drop images with complicated frame_num (1b, 2b)\n",
    "dropped_images = []\n",
    "\n",
    "for seq in sequences:\n",
    "    \n",
    "    if len(seq['images']) == 1:\n",
    "        continue\n",
    "        \n",
    "    if seq['trigger'] == 'T':\n",
    "        dropped_images.append(seq['images'][1])\n",
    "    \n",
    "    seq['images'] = [seq['images'][0]]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "14752"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(dropped_images)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 2 - Pass the schema check\n",
    "\n",
    "Once your metadata are in the MegaDB format for `sequence` items, we check that they conform to the format's schema.\n",
    "\n",
    "If the format conforms, the following messages will be printed:\n",
    "\n",
    "```\n",
    "Verified that the sequence items meet requirements not captured by the schema.\n",
    "Verified that the sequence items conform to the schema.\n",
    "```\n",
    "\n",
    "For large datasets, the second step will take some time (~ a minute). \n",
    "\n",
    "Otherwise there will be an error message describing what's wrong. Please fix the issues until all checks are passed. You might need to write some snippets of code to loop through the `sequence` items to understand which entries have problems."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Verified that the sequence items meet requirements not captured by the schema.\n",
      "Verified that the sequence items conform to the schema.\n",
      "CPU times: user 23min 20s, sys: 55.5 s, total: 24min 15s\n",
      "Wall time: 25min 25s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "\n",
    "sequences_schema_check.sequences_schema_check(sequences)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 4 - Save the `sequence` items to a file\n",
    "\n",
    "You can now take the resulting JSON file to the .Net application for bulk insertion to the database:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "with open(path_to_output, 'w') as f:\n",
    "    json.dump(sequences, f)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python [conda env:cameratraps] *",
   "language": "python",
   "name": "conda-env-cameratraps-py"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
